Perceptual harmonic cepstral coefficients as the front-end for speech recognition

ABSTRACT

Pitch estimation and classification into voiced, unvoiced and transitional speech were performed by a spectro-temporal auto-correlation technique. A peak picking formula was then employed. A weighting function was then applied to the power spectrum. The harmonics weighted power spectrum underwent mel-scaled band-pass filtering, and the log-energy of the filter&#39;s output was discrete cosine transformed to produce cepstral coefficients. A within-filter cubic-root amplitude compression was applied to reduce amplitude variation without compromise of the gain invariance properties.

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional PatentApplication No. 60/237,285, which application is herein incorporated byreference.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

[0002] This invention was made with Government support under Grant No.IIS-9978001, awarded by the National Science Foundation. The Governmenthas-certain rights in this invention.

BACKGROUND OF THE INVENTION

[0003] 1. Field of the Invention

[0004] The field of the invention is both noisy and clean speechrecognition.

[0005] 2. Description of Related Art

[0006] While most modem speech recognition systems focus on the speechshort-term spectrum for feature analysis-extraction, also referred to asthe “front-end” analysis, the technique attempts to capture informationon the vocal tract transfer function from the gross spectral shape ofthe input speech, while eliminating as much as possible the irrelevanteffects of excitation signals. However, the accuracy and robustness ofthe speech representation may deteriorate dramatically due to thespectral distortion caused by the additive background noise. Also, noiserobust feature extraction poses a great challenge in the design of highperformance automatic speech recognition systems. Over the last severaldecades, a number of speech spectral representations have beendeveloped, among which the mel-frequency cepstral coefficients (MFCC)have become most popular. [M. J. Hunt, “Spectral signal processing forASR”, Proc. ASRU'99, December 1999 and S. B. Davis and P. Mermelstein,“Comparison of parametric representations for monosyllabic wordrecognition in continuous spoken sentences”, IEEE Trans. Acoust.,Speech, Signal Processing, pp. 357-366, vol. 28, August 1980]. TheMFCCs, though adopted by most ASR systems for its superiority in cleanspeech recognition, do not cope well with noisy speech. The alternativeperceptual linear prediction (PLP) coefficients promise improvement overMFCC in noisy conditions by incorporating perceptual features of thehuman auditory mechanism. Nevertheless, it is believed that the existingfront ends are sub-optimal, and the discovery of new noise-immune ornoise-insensitive features is needed.

[0007] Two problems plague conventional MFCC front-end analysistechniques. The first is concerned with the vocal tract transferfunction whose accurate description is crucial to effective speechrecognition. However, the irrelevant information of excitation signalsmust be removed for accurate spectral representation. In the MFCCapproach, a smoothed version of the short-term speech spectrum iscomputed from the output energy of a bank of filters, i.e., the spectrumenvelope is computed from energy averaged over each mel-scaled filter.While such a procedure is fast and efficient, it is inaccurate as thevocal tract transfer function information is known to reside in thespectral envelope which is mismatched with the smoothed spectrum,especially for voiced sounds and transitional speech. Alternativeapproaches based on direct spectral envelope estimation have beenreported. [H. K. Kim and H. S. Lee, “Use of spectral autocorrelation inspectral envelope linear prediction for speech recognition”, IEEE Trans.Speech and Audio Processing, vol. 7, no. 5, pp.533-541, 1999].

[0008] Moreover, the spectrum envelope tends to have much higher signalto noise ratio (SNR) than smoothed spectrum under the same noiseconditions, which leads to a more robust representation of the vocaltract transfer function. Hence, speech features derived from thespectral envelope are expected to provide better performance in noisyenvironments compared with traditional front ends based on smoothedspectrum [Q. Zhu and A. Alwan, “AM-demodulation of speech spectra andits application to noise robust speech recognition”, Proc. ICSLP'2000,October 2000]. Thus, the MFCC approach may not work well for voicedsounds with quasi-periodic features, as the formant frequencies tend tobe biased toward pitch harmonics, and formant bandwidth may bemisestimated. Experiments show that this mismatch substantiallyincreases the feature variance within the same utterance.

[0009] Another difficulty encountered in conventional accoustic analysis(e.g., MFCC) is that of appropriate spectral amplitude transformationfor higher recognition performance. The log power spectrumrepresentation in MFCC is clearly attractive because of itsgain-invariance properties and the approximate Gaussian distributions itthus provides. Cubic root representation is used in the PLPrepresentation for psychophysical considerations, at the cost ofcompromising the level-invariance properties and hence robustness. [H.Hermansky, “Perceptual linear predictive (PLP) analysis of speech”, J.Acoust. Soc. America, pp. 1738-1752, vol. 87, no. 4, April 1990].

[0010] Modem speech recognition systems retrieve information on thevocal tract transfer function from the gross spectral shape. The speechsignal is generated via modulation by an excitation signal that isquasi-periodic for voiced sounds, and white-noise for unvoiced sounds. Atypical approach, employed in MFCC and PLP, is to compute the energyoutput of a bank of band-pass mel-scaled or bark-scaled filters, whosebandwidths are broad enough to remove fine harmonic structures caused bythe quasi-periodic excitation of voiced speech. The efficiency andeffectiveness of these spectral smoothing approaches led to theirpopularity. However, there are two drawbacks that significantlydeteriorate their accuracy.

[0011] The first drawback is the limited ability to remove undesiredharmonic structures. In order to maintain adequate spectral resolution,the standard filter bandwidth in MFCC and PLP is usually in the range of200 Hz-300 Hz in the low frequency region. It is hence sufficientlybroad for typical male speakers, but not broad enough for high pitch (upto 450 Hz) female speakers. Consequently, the formant frequencies arebiased towards pitch harmonics and their bandwidth is misestimated.

[0012] The second drawback concerns information extraction tocharacterize the vocal tract function. It is widely agreed in the speechcoding community that it is the spectral envelope and not the grossspectrum that represents the shape of the vocal tract [M. Jelinek, etal., supra]. Although the smoothed spectrum is often similar to thespectral envelope of unvoiced sounds, the situation is quite differentin the case of voiced and transitional sounds. Experiments show thatthis mismatch substantially increases the spectrum variation within thesame utterance. This phenomenon is illustrated in FIG. 1 with thestationary part of the voiced sound [a]. FIG. 1 demonstrates that theupper envelope of the power spectrum sampled at pitch harmonics isnearly unchanged, while the variation of the lower envelope isconsiderable. The conventional smoothed spectrum representation may beroughly viewed as averaging the upper and lower envelopes. It thereforeexhibits much more variation than the upper spectrum envelope alone.

[0013] The third drawback is the high spectral sensitivity to backgroundnoise. The conventional smoothed spectrum representation may be roughlyviewed as averaging the upper and lower envelopes. It therefore exhibitsmuch higher SNR than the upper spectrum envelope alone in noisyconditions.

[0014] Although some of the loss caused by the imprecision of spectrumsmoothing may be compensated for and masked by higher complexitystatistical modeling, the recognition rate eventually reaches saturationat high model complexity. The present invention discloses that thesub-optimality of the front-end is currently a major performancebottleneck of powerful, high complexity speech recognizers. Therefore,the present invention discloses the alternative of Harmonic CepstralCoefficients (HCC), as a more accurate spectral envelope representation.

BRIEF SUMMARY OF THE INVENTION

[0015] The present invention overcomes the above shortcomings, and isinspired by ideas from speech coding. [M. Jelinek and J. P. Adoul,“Frequency-domain spectral envelope estimation for low rate coding ofspeech”, Proc. ICASSP'99, pp. 253-256, 1999]. Rather than average theenergy within each filter, which results in a smoothed spectrum as inMFCC, the harmonic cepstral coefficients (HCC) are derived for voicedspeech from the spectrum envelope sampled at harmonic locations forvoiced speech. They are similar to MFCC for unvoiced sounds and silence.The extraction of HCCs requires accurate and robust pitch estimation.The present invention uses the spectro-temporal auto-correlation (STA)method for accurate and robust pitch estimation that was previouslydeveloped for sinusoidal speech coders. [Y. D. Cho, M. Y. Kim and S. R.Kim, “A spectrally mixed excitation (SMX) vocoder with robust parameterdetermination”, Proc. ICASSP'98, pp. 601-604, 1998]. The STA pitchestimation is based on weighted summation of the temporal and spectralauto-correlation values, and efficiently reduces multiple andsub-multiple pitch errors.

[0016] The computed (weighted) correlation is further useful forvoiced-unvoiced-transitional speech detection. For voiced speech, theharmonic locations are derived from the estimated pitch information, anda peak-picking formula is employed to find the actual harmonic pointsnear the predicted positions. For transitional speech, a fixed pitch isused within the peak-picking process. The resulting harmonic spectrum isput through mel-scaled band-pass filters and transformed into cepstrumby the discrete cosine transform. The HCC representation is furtherimproved by applying the intensity-loudness power-law within eachfilter, i.e., applying the cubic-root amplitude compression within eachfilter, along with logarithmic energy across filters, to reduce thespectral amplitude variation within each filter without degradation ofthe gain-invariance properties. The resulting features form the“perceptual” HCC (PHCC) representation. Due to the psychophysicalintensity-loudness power law, the spectral amplitude variation withineach filter is reduced, without degradation of the desiredgain-invariance properties, as the filter energy levels are stillrepresented in logarithmic scale. Experiments with the Mandarin digitand the E-set databases show that PHCC significantly outperformsconventional MFCC for both voiced and unvoiced speech.

[0017] In another embodiment of the present invention, the PHCC frontend is extended for speech recognition in noisy environments byincorporating several “anti-noise” techniques. A weight function isdesigned for the computation of the harmonic weighted spectrum tomitigate the distortion of harmonic structures caused by backgroundnoise. The weight function depends on the prominence of harmonicstructure in the frequency domain, instead of thevoice/unvoice/transition classification. The power spectrum is lowerclipped prior to amplitude or root-power compression to reduce the noisesensitivity associated with small spectral values and to enhance SNR.The root-power function is adjustable to the noisy environmentcharacteristics. Experiments with the Mandarin digit database undervaried noisy environments show that PHCC does provide significantimprovement over conventional MFCC under noisy conditions.

[0018] In yet a further embodiment of the present invention, a newsplit-band PHCC (SB-PHCC) approach was used to enhance and extend PHCCvia split-band spectral analysis. The speech spectrum is split, at acutoff frequency, into two spectral bands corresponding to harmonic andnon-harmonic components. The harmonic weighted spectrum is used in theharmonic band, and traditional smoothed spectrum is adopted for thenon-harmonic band. The cutoff frequency selection is optimized bymaximizing the average voicing strength ratio of harmonic tonon-harmonic bands. Experiments with Mandarin digit and E-set databasesshow that SB-PHCC significantly outperforms plain PHCC and yieldsgreater gains over conventional MFCC.

[0019] These and other features, aspects, and advantages of the presentinvention will become better understood with regard to the followingdetailed description, claims and accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0020]FIG. 1 is a graph illustrating the power spectrum curves(512-point Fast Fourier transform (FFT)) for 5 consecutive frames inspeech segment [a];

[0021]FIG. 2 is a block diagram of HCC analysis;

[0022]FIG. 3 is a schematic representing PHCC speech analysis;

[0023]FIG. 4 is a flowchart of PHCC analysis followed in noisy speechrecognition; and

[0024]FIG. 5 is a flowchart of split-band PHCC (SB-PHCC) analysis.

DETAILED DESCRIPTION OF THE INVENTION

[0025] Perceptual Harmonic Cepstral Coefficient Computation (PHCC)

[0026] PHCC computation is similar to that of MFCC except that itattempts to closely approximate the spectral envelope sampled at pitchharmonics. The procedure was tested in a clean speech environment andcomprised the following steps:

[0027] 1) The speech frame was processed by FFT to obtain the short-termpower spectrum;

[0028] 2) Robust pitch estimation and voiced/unvoiced transition(V/UV/T) classification were performed (the spectro-temporalauto-correlation (STA) was used followed by the peak-picking formula);

[0029] 3) Class-dependent harmonic weighting was applied to obtain theharmonics weighted spectrum (HWS). For voiced and transitional speech,HWS was dominated by the harmonic spectrum (i.e., upper envelope of theshort-term spectrum). For unvoiced sounds, HWS became equivalent to theconventional smoothed spectrum.

[0030] 4) Mel-scaled filters were applied to the HWS and the log energyoutput was computed and transformed into cepstrum by the discrete cosinetransform (DCT).

[0031] A block diagram of HCC is shown in FIG. 2. Steps 2) to 4) areherein explained in greater detail.

[0032] Robust Pitch Estimation by the Spectro-temporal Auto-correlation(STA) Formula

[0033] Spectral envelope representation required robust pitchestimation. Minor errors were easily corrected by the peak-pickingformula, described herein. However, errors due to pitch multiples orsub-multiples greatly corrupt the HWS computation for voiced speechframes. To mitigate the latter error types, the present invention usedthe STA formula that was first proposed for the design of harmonicspeech coders. [Y. D. Cho, et al., supra].

[0034] Temporal auto-correlation (TA) has been traditionally used forpitch estimation. Given a speech signal s₁(n), the TA criterion forcandidate pitch τ is defined as${R^{T}(\tau)} = \frac{\sum\limits_{n = 0}^{N - \tau - 1}\left\lbrack {{{\overset{\sim}{s}}_{i}(n)} \cdot {{\overset{\sim}{s}}_{i}\left( {n + \tau} \right)}} \right\rbrack}{\sqrt{\sum\limits_{n = 0}^{N - \tau - 1}{{{\overset{\sim}{s}}_{i}^{2}(n)} \cdot {\sum\limits_{n = 0}^{N - \tau - 1}{{\overset{\sim}{s}}_{i}^{2}\left( {n + \tau} \right)}}}}}$

[0035] where {tilde over (s)}₁(n) is the zero-mean version of s₁(n), andN is the number of samples for pitch estimation. The pitch estimate wasobtained by maximizing TA. Unfortunately, TA occasionally selects pitchmultiples, especially when the speech signal is highly periodic with ashort pitch period. This error is disastrous for the purposes of thepresent invention as it corrupts the estimated harmonic spectralenvelope. Spectral auto-correlation (SA) was proposed to circumvent thepitfall of pitch multiples, and is defined as:${R^{S}(\tau)} = \frac{\int_{0}^{\tau - \omega_{\tau}}{{{\overset{\sim}{S}}_{f}(\omega)}{{\overset{\sim}{S}}_{f}\left( {\omega + \omega_{\tau}} \right)}}}{\sqrt{\int_{0}^{\tau - \omega_{\tau}}{{{\overset{\sim}{S}}_{f}^{2}(\omega)}{\int_{0}^{\tau - \omega_{\tau}}{{\overset{\sim}{S}}_{f}^{2}\left( {\omega + \omega_{\tau}} \right)}}}}}$

[0036] where ω_(τ)=2π/τ, S_(f)(ω) is the magnitude spectrum of s₁(n),and {tilde over (S)}_(f)(ω) is the zero-mean version of S_(f)(ω)

[0037] Clearly, the danger here is of pitch sub-multiples. To mitigateboth error types, STA was defined as an average criterion:

R(τ)=β·R ^(τ)(τ)+(1−β)·R ^(S)(τ)

[0038] where β=0.5 was found to yield good results in practice. [Y. D.Cho, et al., supra].

[0039] The STA criterion R(τ) was also used to perform V/UV/T detection.If R(τ)>α_(v), the speech frame is classified as voiced, if R(τ)<α_(u),it is classified as unvoiced, and if α_(v)≧R(τ) α_(u), it is declaredtransitional. The two thresholds can be determined based on experiments.While the thresholds were α_(y)=0.8 and α_(u)=0.5 in the presentinvention, the optimal value of these thresholds can vary between about1≧α_(v)≧0.5 and 0.5≧α_(u)≧0.3, respectively, for different recognitiontasks and noise environments.

[0040] The Peak-picking Formula

[0041] In the case of voiced speech frames, a more accuratedetermination of the harmonic frequencies was obtained by applying thepeak-picking formula to the power spectrum, which corrected minor pitchestimation errors or non-integer pitch effects. The initial estimatedharmonics obtained from STA were refined by looking for local maxima ina search interval that excluded neighboring harmonics. Once the peakswere found, the power spectrum value at pitch harmonics was givenemphasis by appropriate weighting, as discussed next.

[0042] The peak-picking formula was also useful for transitional speechframes, as they contain some quasi-harmonic structures. Since there areno well-defined initial harmonic frequencies, they were set to fixedvalues (multiples of 100 Hz were quite effective in the examples).

[0043] Harmonics Weighted Spectrum (HWS)

[0044] Spectral envelope representation as above has been previouslyproposed and is currently used in harmonic speech coding, where thespectrum amplitude sampled at pitch harmonics is vector quantized.However, the number of harmonics varies significantly from speaker tospeaker (a problem that led to growing interest in variable dimensionvector quantization). This also implies that some processing must beapplied to the harmonic spectrum prior to its applicability to speechrecognition. We propose to use the harmonics weighted energy output ofmel-scale filters instead of the harmonic spectrum directly.

[0045] In the case of voiced speech, the most important informationavailable about the spectral envelope is captured by the spectrumsampled at pitch harmonic frequencies. If the spectrum between pitchharmonics is smooth, interpolation methods can be used to retrieve thespectrum spline, albeit with high sensitivity to pitch estimationerrors. Instead, the present invention disclosed a different approachcalled harmonics weighted spectrum (HWS) estimation. Given S_(f)(ω), themagnitude spectrum of input speech, HWS is defined as:HWS(ω) = w_(h)(ω) ⋅ S_(f)(ω)${{where}\quad {w_{h}(\omega)}} = \left\{ \begin{matrix}{W_{H},} & {\omega \quad {is}\quad {pitch}\quad {harmonic}} \\{1,} & {otherwise}\end{matrix} \right.$

[0046] As shown in FIG. 2, the filter log-energy is calculated from theHWS and followed by DCT to generate the cepstral coefficients.

[0047] In our simulations, W_(H) was set to 100 for voiced sounds and 10for transitional sounds. The HWS of voiced speech reflected the spectrumspline at harmonic points. In the case of unvoiced speech, HWS wassimply the power spectrum. The HWS of transitional speech representedthe power spectrum with emphasis on quasi-harmonic points. Therefore,when combined with mel-scaled band-pass filtering, HWS was effectivelyused to extract parameters that characterize the spectral envelope forthe three classes of speech frames.

[0048] Perceptual Harmonic Cepstral Coefficients

[0049] A. Within-filter Amplitude Compression

[0050] It is widely recognized that auditory properties can be exploitedto improve automatic speech recognition. Perhaps the most notableexample is the common use of band-pass filters of broader bandwidth athigh frequencies, according to the frequency resolution of the humanear. MFCC implements this by mel-scaled spacing, and PLP employscritical-band spectral resolution. Another important aspect, theperceptual transformation of the spectrum amplitude, is handled inradically different ways by the leading front-end systems. PLP appliesthe equal-loudness curve and the intensity-loudness power law to betterexploit knowledge about the auditory system, but requires scalenormalization, which was experimentally found critical for the overallrecognition performance. MFCC sacrifices some perceptual precision andcircumvents this difficulty by approximating the auditory curve with alogarithmic function that offers the elegant level-invarianceproperties.

[0051] In an attempt to enjoy the best of both worlds, the presentinvention applied a novel approach, the use of intensity-loudnesspower-low (e.g., use of cubic-root amplitude compression) within eachfilter and computed the log energy over all filters. Hence,

Ŝ(ω)=[S(ω)]^(1/3)

Ê ₁ =log(E ₁),1≦i≦M

[0052] where Ŝ(ω) is the compressed spectrum and E, is the log energyfor band-pass filter i. The resulting spectrum representationsignificantly reduced the amplitude variation within each filter,without degradation of the gain-invariance properties and, since thefilter energy levels were still represented in logarithmic scale,without recourse to normalization.

[0053] B. Perceptual Harmonic Cepstrum Coefficients

[0054] The above perceptual amplitude transformation was incorporatedwithin the HCC framework to obtain the proposed perceptual HCC (PHCC),as is shown in FIG. 3. Note: the within-filter amplitude compressionreduced envelope corruption damage caused by pitch harmonic errors inthe case of voiced sounds, and decreased amplitude variation due towhite-noise in unvoiced sounds. It thus improved the accuracy androbustness of spectral envelope estimation.

EXAMPLE 1

[0055] To test the performance of PHCC, experiments were first carriedout on a database of speaker-independent isolated Mandarin digitscollected in an office environment. The recognition task consists of 11pronunciations representing 10 Mandarin digits from 0 to 9, with 2different pronunciations for the digit “1” ([i] and [iao]). The databaseincludes 150 speakers (75 male and 75 female), one utterance perspeaker. Of the 150 speakers, 60 male and 60 female speakers wereselected at random for training, and the remaining 30 speakers were setaside for the test set.

[0056] In our examples, 26-dimension speech features were used,including 12 cepstral (MFCC or PHCC) parameters, log energy, and theirdynamics (time derivatives). We used an analysis frame of width 30 msand step of 10 ms, and a Hamming window. 9-state tied-mixture HMM wasused with 99 single Gaussian pdfs. The experiment results for PHCC andMFCC are summarized in Table 1. TABLE 1 Test-set error rate based onPHCC and MFCC for speaker-independent isolated Mandarin digitrecognition Male & Male Female Female MFCC 0.5% 3.0% 2.1% PHCC 0.2% 1.4%1.1%

[0057] TABLE 2 Test-set error rate based on PHCC and MFCC for EnglishE-set recognition Acoustic 7-state 13-state 21-state Models CHMM CHMMTMHMM MFCC 15.3% 11.0% 7.3% PHCC 12.2% 9.0% 6.2%

[0058] Table 1 shows that the error rate has been decreased by about 50%for both male and female speakers, and demonstrates the consistentsuperiority of PHCC over speakers with differing pitch levels. The mainsource of errors in recognizing Mandarin digits is the confusion betweenvowels such as [a] and [e]. This is where the spectral envelope basedPHCC substantially outperforms conventional MFCC, hence the significantgains observed.

[0059] To critically test the performance of PHCC on unvoiced sounds,experiments were further carried out on OGI's E-set database. Therecognition task is to distinguish between nine confusable Englishletters {b, c, d, e, g, p, t, v, z}, where the vowels are of minimalsignificance to the classification task. The database was generated by150 speakers (75 male and 75 female) and includes one utterance perspeaker. The results are summarized in Table 2.

[0060] PHCC achieved better results than MFCC over a range of acousticmodel complexities, and offers over 15% error reduction relative toMFCC. As the utterances in the E-set database mainly differ in theunvoiced sounds, the improvement is contributed to the new perceptualamplitude transformation and the handling of transition sounds in theharmonic spectrum estimation.

EXAMPLE 1 Results

[0061] The proposed harmonic cepstral coefficients (HCC) offer arepresentation of the spectral envelope based on the harmonic spectrum,which is a weighted version of the power spectrum that emphasizes pitchharmonics. The weighting function depends on the frame's V/UV/Tclassification. In order to exploit both the psychophysical andgain-invariance properties of PLP and MFCC, respectively, the methodemploys within-filter cubic root amplitude compression and logarithmiclevel-scaled band-pass filtering. Experiments on the Mandarin digit andE-set databases show substantial performance gains of PHCC over MFCC.Future work will focus on the extension of PHCC to perceptual harmoniclinear prediction.

[0062] We tested PHCC both on the OGI E-set database and thespeaker-independent isolated Mandarin digit database to compare withstandard MFCC. On E-set, with 7-state continuous HMMs, the test setrecognition rate increased from about 84.7% (MFCC) to 87.8% (PHCC), i.e.20% error rate reduction. With 21-state tied-mixture HMMs (TMHMM), theaccuracy improved from about 92.7% to 93.8%, i.e., 15% error ratereduction. For the Mandarin digit database, the error rate based on9-state TMHMMs is decreased from about 2.1% to 1.1%, which translatesinto a considerable 48% error rate reduction.

[0063] PHCC Computation in Noisy Speech Environments

[0064] PHCC was also extended to noisy speech recognition. To achievethis goal, several anti-noise modifications were applied to ourpreviously proposed PHCC method. The procedure comprised the followingsteps:

[0065] 1) The speech frame is processed by DFT to obtain the short-termpower spectrum;

[0066] 2) The intensity-loudness power law is applied to the originalspectrum to obtain the root-power compressed spectrum;

[0067] 3) Robust pitch estimation and voiced/unvoiced/transition(V/UV/T) classification are performed (We employ the spectro-temporalauto-correlation (STA) followed by the peak-picking formula);

[0068] 4) Class-dependent harmonic weighting is applied to obtain theharmonics weighted spectrum (HWS). For voiced and transitional speech,HWS is dominated by the harmonic spectrum (i.e. upper envelope of theshort-term spectrum). For unvoiced sounds, HWS degenerates to theconventional smoothed spectrum.

[0069] 5) Mel-scaled filters are applied to the HWS and the log energyoutput is computed and transformed into cepstrum by the discrete cosinetransform (DCT).

[0070] A flowchart of PHCC computation is shown in FIG. 4. Steps 3 and 4are explained in greater detail herein.

[0071] Modified Weight Function for the HWS

[0072] The advantages of spectral envelope representation overconventional smoothed spectrum representation are less obvious in noisyenvironments. On the one hand, the harmonic spectrum estimation discardsthe variations in the valleys between harmonic locations caused by thebackground noise, which leads to more robust spectral representation. Onthe other hand, the original harmonic structure in voiced andtransitional speech may be blurred significantly by the input additivenoise, especially in high frequency regions. A solution to theseproblems calls for a more effective weight function for the HWS.

[0073] Here we propose a modified weight function for HWS estimation innoisy environments. A new parameter, harmonic confidence, is defined as${H_{a} = {\max\limits_{\tau}{R(\tau)}}},$

[0074] where R(τ) is the spectro-temporal autocorrelation criterion.

[0075] The harmonic weight of (1) is now modified to${w_{h}(\omega)} = \left\{ {\begin{matrix}{{\max \quad \left( {1,e^{{({H_{a} - \eta})}\gamma}} \right)},{{{if}\quad \omega} \leq {\omega_{T}\quad {is}\quad {pitchharmonic}}}} \\{1,{otherwise}}\end{matrix},} \right.$

[0076] where ω_(T) is the cut-off frequency. In the modified HWScomputation, the harmonic-based spectral envelope is emphasized in thelow frequency zone below A, whose harmonic structure is more robust toadditive noise. The conventional smoothed spectrum is retained in thehigh frequency zone above or. In addition, the weight value depends onthe harmonic confidence H_(α), to account for the effect of noisesignals, where η is the harmonic confidence threshold, and γ is theweight factor. In our experiment, ω_(T), η and γ rare set to 2.5 kHz,0.5 and 10, respectively.

[0077] Pre-compression Spectral Masking

[0078] One major shortcoming of logarithm-based approaches (includingMFCC and PLP) is that the logarithm function is unbounded as itsargument tends to zero. It is thus very sensitive to small input values.This may greatly deteriorate the representation robustness, as these lowenergy parts hold the worst SNR under noisy environments. A common noisereduction technique is to apply a lower bound to the original spectrum(D. H. Klatt, “A digital filter bank for spectral matching”, Proc.ICASSP'79, pp. 573-576, 1979) before the logarithm operation. We foundthat this technique may be beneficially applied to the within-filteramplitude compression.

[0079] If S(ω) is the original spectrum, the masking operation can bedefined as

{tilde over (S)}(ω)=max(S(ω),c),

[0080] where c is a very small value, which may either be a fixed numberor vary depending on noise conditions.

[0081] Root-power Representation

[0082] Another modification to improve the performance of PHCCrepresentation in noisy environments consists of replacing theintensity-loudness power-low with

Ŝ(ω)=[{tilde over (S)}(ω)]^(θ)

[0083] where θ is the root-power factor. While it was previously set toa fix value in clean speech recognition, it may now be adjusted to thenoise environment.

EXAMPLE 2

[0084] To test the performance of PHCC, experiments were first carriedout on a database of speaker-independent isolated Mandarin digitscollected in white and babble noise environment. The recognition taskconsists of 11 pronunciations representing 10 Mandarin digits from 0 to9, with 2 different pronunciations for the digit “1” ([i] and [iao]).The database includes 150 speakers (75 male and 75 female) with oneutterance per speaker. Of the 150 speakers, 60 male and 60 femalespeakers were selected at random for training, and the remaining 30speakers were set aside for the test set.

[0085] In our experiment, 26-dimension speech features were used,including 12 cepstral (MFCC or PHCC) parameters, log energy, and theirdynamics (time derivatives). We used an analysis frame of width 30 msand step of 10 ms, and a Hamming window. 9-state continuous-density HMMwas used with single Gaussian pdf per state. The experiment results forPHCC and MFCC are summarized in Table 3 and 4. TABLE 3 Test-set errorrates of PHCC and MFCC for speaker-independent isolated Mandarin digitrecognition under white noise environment Front-end Clean 20 dB 10 dB 0dB MFCC 2.1% 4.8% 16.9% 45.6 PHCC 1.1% 2.9% 13.0% 29.1

[0086] TABLE 4 Test-set error rates of PHCC and MFCC forspeaker-independent isolated Mandarin digit recognition under babblenoise environment Front-end Clean 20 dB 10 dB 0 dB MFCC 2.1% 4.1% 13.3%35.2% PHCC 1.1% 2.3% 10.5% 27.4%

[0087] Table 3 shows that the error rate decreased by nearly 50% inclean speech environment and by 23% to 36% in white noise environment,and demonstrates consistent superiority of PHCC over MFCC at differingnoise levels. Table 4 shows that similar improvement of PHCC is achievedin babble noise environment. The main source of errors in recognizingMandarin digits is the confusion between vowels such as [a] and [e].This is where the spectral envelope based PHCC substantially outperformsconventional MFCC, hence the significant and consistent gains observedin clean speech and noisy environments. The improvement in noisyenvironment is also attributed to modified weight function for HWS, andthe within-filter root-power amplitude compression following low-boundmasking procedure.

[0088] Split-band PHCC

[0089] The advantage of PHCC over conventional acoustic analysis methodsis mainly attributed to its spectral envelope estimation. It is widelyrecognized in the speech coding community that it is the spectralenvelope and not the gross spectrum that represents the shape of thevocal tract. However, spectral envelope estimation may greatly reducethe representation accuracy and robustness in the case of non-harmonicsounds. In our early PHCC approach, the possible distortion due tospectral envelope extraction was mitigated by effective V/UV/Tdetection. Nevertheless, significant distortion was observed in voicedand transitional speech since the spectral envelope was estimated by HWSthroughout the frequency domain. While HWS performs well in the harmonicregion of the speech spectrum, it tends to impart an undesirable effectto noise-like non-harmonic regions and hence reduce robustness. Toovercome this drawback, we propose the split-band PHCC (SB-PHCC), inwhich spectral envelope extraction is restricted to the harmonic bandwhere the harmonic structure is rich and reliable, while conventionalsmoothed spectral estimation is applied to the non-harmonic band forhigher representation robustness and accuracy.

[0090] A flowchart of the SB-PHCC formula is shown in FIG. 5. The speechsignal undergoes discrete Fourier transformation, followed by root-powercompression, as in plain PHCC. However, in SB-PHCC the STA formula isnot only adopted for robust V/UV/T detection and pitch estimation, butalso for split-band analysis by computing three new parameters, namely,harmonic confidence, voicing strength and cutoff frequency, whichreflect the prominence of harmonic structures in the speech spectrum.These parameters, as well as the peak-picked harmonic locations, formthe basis of split-band HWS estimation. The extracted mixed spectrumpasses through mel-scaled band-pass filters, followed by discrete cosinetransform, and results in the split-band perceptual harmonic cepstralcoefficients.

[0091] Spectro-temporal Autocorrelation (STA)

[0092] Robust pitch estimation is critical for harmonic-based spectralenvelope representation. Although small errors could be corrected by apeak-picking formula as described in, pitch multiple and sub-multipleerrors can greatly reduce the accuracy of the spectral envelope forvoiced sounds. One effective approach to eliminate such errors is theSTA formula, which was first proposed for design of harmonic speechcoders. In this paper, STA is further harnessed to measure the harmoniccharacteristics of the speech spectrum, via the computation of three newparameters described herein.

[0093] Given a speech signal s₁(n), the temporal auto-correlation (TA)for candidate pitch τ is defined as${R^{T}(\tau)} = \frac{\sum\limits_{n = 0}^{N - \tau - 1}\left\lbrack {{{\overset{\sim}{s}}_{i}(n)} \cdot {{\overset{\sim}{s}}_{i}\left( {n + \tau} \right)}} \right\rbrack}{\sqrt{\sum\limits_{n = 0}^{N - \tau - 1}{{{\overset{\sim}{s}}_{i}^{2}(n)} \cdot {\sum\limits_{n = 0}^{N - \tau - 1}{{\overset{\sim}{s}}_{i}^{2}\left( {n + \tau} \right)}}}}}$

[0094] where s₁(n) is the zero-mean version of s₁(n), and N is thenumber of samples for pitch estimation.

[0095] Motivated by the pitch multiple errors that were observed in theconventional TA method, the spectral auto-correlation (SA) criterion wasintroduced and defined as${R^{S}(\tau)} = \frac{\int_{0}^{\tau - \omega_{\tau}}{{{\overset{\sim}{S}}_{f}(\omega)}{{\overset{\sim}{S}}_{f}\left( {\omega + \omega_{\tau}} \right)}}}{\sqrt{\int_{0}^{\tau - \omega_{\tau}}{{{\overset{\sim}{S}}_{f}^{2}(\omega)}{\int_{0}^{\tau - \omega_{\tau}}{{\overset{\sim}{S}}_{f}^{2}\left( {\omega + \omega_{\tau}} \right)}}}}}$

[0096] where ω_(τ)=2π/τ,S_(f)(ω) is the power spectrum of s₁(n), and{tilde over (S)}_(f)(ω) is the zero-mean version of S_(f)(ω). However,pitch sub-multiple may occur in SA. STA was devised to reduce both pitchmultiple and sub-multiple errors, and is defined as:

R(τ)=β·R ^(τ)(τ)+(1−β)·R ^(s)(τ)  (1)

[0097] where β=0.5 was reported to yield the best results in.

[0098] Harmonic Weighted Spectrum (HWS)

[0099] Spectral envelope representation is currently widely used inharmonic speech coding, and more recently in speech recognition. If thespeech spectrum between pitch harmonics is smooth, interpolation ornormalization methods can be used to retrieve the spectrum spline,albeit with high sensitivity to pitch estimation errors. Instead, weproposed an approach called harmonic weighted spectrum (HWS) estimation.Given S_(f)(ω), the magnitude spectrum of input speech, HWS is definedas

HWS(ω)=w _(h)(ω)·S _(f)(ω)

[0100] where w_(h)(ω) is the harmonic weighting function which wasoriginally defined in as $\begin{matrix}{{w_{h}(\omega)} = \left\{ {\begin{matrix}{W_{H},} & {\omega \quad {is}\quad {pitch}\quad {harmonic}} \\{1,} & {otherwise}\end{matrix},} \right.} & (2)\end{matrix}$

[0101] where W_(H) was adjusted depending on the V/UV/T classification.It was set to a high value for voiced sounds and intermediate value fortransitional sounds. The harmonic weighting function is modified in thispaper, as will be explained next.

[0102] Split-band Analysis

[0103] The PHCC harmonic weighting function w_(h)(ω) does not take intoaccount the distortion of spectral envelope estimation at non-harmoniclocations for both voiced and transitional speech. A split-band analysisis hence proposed here to eliminate this drawback. The underlyingpremise of this technique is that there exists a single transitionfrequency (the voicing cutoff frequency) below which the harmonicstructure is rich and clear, and above which the spectrum is essentiallynon-harmonic. Therefore, for voiced and transitional sounds, theoriginal spectrum is split into two bands—the (low frequency) harmonicband and the (high frequency) non-harmonic band. Given the differingcharacteristics of the two bands, potential gains are expected if theyare treated separately. In the proposed SB-PHCC, HWS is implemented inthe harmonic band, while MFCC is used in the non-harmonic band. Thus,the accuracy of the spectral envelope representation is maintained byharmonic-weighted spectral estimation, while the noise-sensitivity inthe non-harmonic band is reduced by the smoothing procedure, where noharmonic-based analysis is necessary.

[0104] To carry out the split-band analysis, three new parameters aredefined and computed to measure the prominence of the harmonicstructures observed in the speech spectrum.

[0105] The prominence of the harmonic structure over the full-band maybe measured by the harmonic confidence, which is defined as${H_{a} = {\max\limits_{\tau}{R(\tau)}}},$

[0106] where R(τ) is the STA defined above.

[0107] The prominence of the harmonic structure about frequency a can bemeasured by the voicing strength, which is defined as${V_{s}(\Omega)} = \frac{\int_{\Omega - \omega_{n}}^{\Omega}{{{\overset{\sim}{S}}_{f}(\omega)}{{\overset{\sim}{S}}_{f}\left( {\omega + \omega_{0}} \right)}{\omega}}}{\sqrt{\left\lbrack {\int_{\Omega - \omega_{n}}^{\Omega}{{{\overset{\sim}{S}}_{f}^{2}(\omega)}{\omega}}} \right\rbrack \left\lbrack {\int_{\Omega}^{\Omega - \omega_{n}}{{{\overset{\sim}{S}}_{f}^{2}(\omega)}{\omega}}} \right\rbrack}}$

[0108] where ω₀ is the fundamental frequency.

[0109] The boundary between harmonic band and non-harmonic band isspecified by a voicing cutoff frequency. The voicing cutoff frequency isrecognized as an important quantity in speech coding, where a number ofrelevant techniques have been developed [D. W. Griffin and J. S. Lim,“Multiband Excitation Coder”, IEEE Trans. ASSP, vol. 36, pp.1223-1235,1988 and E. K. Kim and Y. H. Oh, “New analysis method for harmonic plusnoise model based on time-domain]. Here we propose an formula based onaverage voicing strength ratio between the harmonic band andnon-harmonic band, which can be described as $\begin{matrix}{\omega_{T} = {\arg \quad {\max\limits_{\omega_{T_{l}} < \omega < \omega_{T_{h}}}\frac{\left\lbrack {\int_{\omega_{T_{l}}}^{\omega}{{V_{s}(\Omega)}\quad {\Omega}}} \right\rbrack/\left( {\omega - \omega_{T_{l}}} \right)}{\left\lbrack {\int_{\omega}^{\omega_{T_{h}}}{{V_{s}(\Omega)}\quad {\Omega}}} \right\rbrack/\left( {\omega_{T_{h}} - \omega} \right)}}}} & (3)\end{matrix}$

[0110] where ω_(T) ₁ and ω_(T) _(h) delimit the allowed interval for thecutoff frequency. In our experiment, we set ω_(T) ₁ =2000π and ω_(T)_(h) =6000π.

[0111] We hence propose a new harmonic weighting function, which issubstantially different from the one we used in plain PHCC The SB-PHCCharmonic weighing function is defined as${w_{h}(\omega)} = \left\{ {\begin{matrix}{{\max \left( {1,^{{({H_{a} - \eta})} \cdot \gamma}} \right)},{{{if}\quad \omega} \leq {\omega_{T}\quad {is}\quad {pitch}\quad {harmonic}}}} \\{1,{otherwise}}\end{matrix},} \right.$

[0112] where η is the harmonic confidence threshold, γ is the weightfactor, and ω_(T) is the cut-off frequency. For voiced sounds, ω_(T) isobtained from (3). For transitional sounds, ω_(T) is fixed due to thereduced reliability of (3) which is compromised by low average voicingstrength in the harmonic band. In our experiments, η and γ are set to0.5 and 10, respectively, and ω_(T) is set to 4000π for transitionalsounds.

[0113] Within-filter Amplitude Compression

[0114] The perceptual amplitude compression procedure we developed forplain PHCC is applied in SB-PHCC to reduce amplitude variation, and issummarized here for completeness.

[0115] It is widely recognized that auditory properties can be exploitedto improve automatic speech recognition. One example is the perceptualtransformation of the spectrum amplitude, which is handled in radicallydifferent ways by the leading acoustic analysis systems. PLP applies theequal-loudness curve and the intensity-loudness power law to betterexploit knowledge about the auditory system, but requires scalenormalization, which was experimentally found to have a critical impacton the overall recognition performance. MFCC sacrifices some perceptualprecision and circumvents this difficulty by approximating the auditorycurve with a logarithmic function that offers the elegantlevel-invariance properties.

[0116] In an effect to enjoy the best of both approaches, we apply theintensity-loudness power-low within each filter and compute the logenergy over all filters. Hence,

Ŝ(ω)=[S(ω)]⁰

Ê _(i) =log(E ₁),1≦i≦M

[0117] where ŝ(ω) is the compressed spectrum, Ê₁ is the log energy forband-pass filter i, and θ is the root-power factor. The resultingspectrum representation can significantly reduce the amplitude variationwithin each filter, without degradation of the gain-invarianceproperties and, since the filter energy levels are still represented inlogarithmic scale, without recourse to normalization.

[0118] The cubic-root amplitude compression (θ=⅓) selected in was foundto perform best in our clean speech experiment. It was, however, notoptimal in our noise speech experiment. Instead, we vary θ with SNR toachieve improve performance (θ is set to ⅔ for very low SNR).

EXAMPLE 3 Results

[0119] To test the performance of SB-PHCC, experiments were firstcarried out on a database of speaker-independent isolated Mandarindigits collected in an office environment. The recognition task consistsof 11 pronunciations representing 10 Mandarin digits from 0 to 9, with 2different pronunciations for the digit “1” ([i] and [iao]). The databaseincludes 150 speakers (75 male and 75 female), one utterance perspeaker. Of the 150 speakers, 60 male and 60 female speakers wereselected at random for training, and the remaining 30 speakers were setaside for the test set.

[0120] In our experiments, 39-dimension speech features were used,including 12 cepstral parameters, log energy, and their first-order andsecond-order dynamics (time derivatives). We used an analysis frame ofwidth 30 ms and step of 10 ms, and a Hamming window. 9-state continuousdensity HMM was used with single Gaussian pdf per state. Theexperimental results for MFCC, PHCC and SB-PHCC are summarized in Table5. It shows substantial decrease in error rate from MFCC, through PHCC,to SB-PHCC, for both male and female speakers.

[0121] To further test the performance of SB-PHCC on unvoiced sounds,additional experiments were carried out oil OGI's E-set database. Therecognition task is to distinguish between nine highly confusableEnglish letters {b, c, d, e, g, p, t, v, z}, where the vowels are ofminimal significance to the classification task. The database wasgenerated by 150 speakers (75 male and 75 female) and includes oneutterance per speaker. The experimental results are summarized in Table6. SB-PHCC achieved consistently better results than PHCC over a rangeof acoustic model complexities, and offers over 15% error reductionrelative to MFCC. TABLE 5 Test-set error rate of MFCC, PHCC and SB-PHCCon isolated Mandarin digit recognition Speaker Male & Gender Male FemaleFemale MFCC 0.6% 3.9% 2.9% PHCC 0.4% 2.4% 1.8% SB-PHCC 0.3% 1.9% 1.4%

[0122] TABLE 6 Test-set error rate of MFCC, PHCC and SB-PHCC on theE-set Acoustic 7-state 13-state 21-state Models CHMM CHMM TMHMM MFCC15.3% 11.0% 7.3% PHCC 12.2% 9.0% 6.2% SB-PHCC 11.3% 8.5% 5.8%

[0123] The following references are incorporated herein by reference: M.J. Hunt, “Spectral signal processing for ASR”, Proc. ASRU'99, December1999; S. B. Davis and P. Mermelstein, “Comparison of parametricrepresentations for monosyllabic word recognition in continuous spokensentences”, IEEE Trans. Acoust., Speech, Signal Processing, pp. 357-366,vol. 28, August 1980; H. Hermansly, “Perceptual linear predictive (PLP)analysis of speech”, J. Acoust. Soc. America, pp. 1738-1752, vol. 87,no, 4, April 1990; M. Jelinek and J. P. Adoul, “Frequency-domainspectral envelope estimation for low rate coding of speech”, Proc.ICASSP'99, pp. 253-256, 1999; Y. D. Cho, M. Y. Kim and S. R. Kim, “Aspectrally mixed excitation (SMX) vocoder with robust parameterdetermination”, Proc. ICASSP'98, pp. 601-604, 1998; Q. Zhu and A. Alwan,“AM-demodulation of speech spectra and its application to noise robustspeech recognition”, Proc. ICSLP'2000, October 2000; L. Gu and K. Rose,“Perceptual harmonic cepstral coefficients as the front-end for speechrecognition”, Proc. ICSLP'2000, October 2000; D. H. Klatt, “A digitalfilter bank for spectral matching”, Proc. ICASSP'79, pp. 573-576, 1979;H. K. Kim and H. S. Lee, “Use of spectral autocorrelation in spectralenvelope linear prediction for speech recognition”, IEEE Trans. Speechand Audio Processing, vol. 7, no. 5, pp.533-541, 1999; L. Gu and K.Rose, “Perceptual harmonic cepstral coefficients for speech recognitionin noisy environment”, Proc. ICASSP'2001, May. 2001; D. W. Griffin andJ. S. Lim, “Multiband Excitation Coder”, IEEE Trans. ASSP, vol. 36,pp.1223-1235, 1988; and E. K. Kim and Y. H. Oh, “New analysis method forharmonic plus noise model based on time-domain periodicity score”, Proc.ICSLP, 2000.

[0124] Although the foregoing invention has been described in somedetail by way of illustration and example for purposes of clarity andunderstanding, it will be obvious that various modifications and changeswhich are within the knowledge of those skilled in the art areconsidered to fall within the scope of the invention.

1. A speech recognition method using a perceptual harmonic cepstralcoefficient comprising: a) processing a speech frame whereby to obtain ashort-term power spectrum; b) performing a robust pitch estimation; c)using a peak-picking formula whereby to obtain a pitch harmonic; d)applying class-dependent harmonic weighting whereby to obtain theharmonics weighted spectrum; e) applying a mel-scaled filter to theharmonics weighted spectrum; and f) computing the log energy outputwhich is transformed into cepstrum by the discrete cosine transform. 2.The method of claim 1 wherein the processing of said speech frame is byFast Fourier transform or Discrete Fourier transform.
 3. The method ofclaim 1 in which the step of performing the robust pitch estimation isin accordance with the formula: R(τ)=β·R ^(τ)(τ)+(1−β)·R ^(s)(τ) whereinβ=0.5 wherein${R^{T}(\tau)} = \frac{\sum\limits_{n = 0}^{N - \tau - 1}\quad \left\lbrack {{{\overset{\sim}{s}}_{t}(n)} \cdot {{\overset{\sim}{s}}_{t}\left( {n + \tau} \right)}} \right\rbrack}{\sqrt{\sum\limits_{n = 0}^{N - \tau - 1}{{{\overset{\sim}{s}}_{t}^{2}(n)} \cdot {\sum\limits_{n = 0}^{N - \tau - 1}{{\overset{\sim}{s}}_{t}^{2}\left( {n + \tau} \right)}}}}}$

 and N is the number of samples for pitch estimation, wherein${R^{S}(\tau)} = \frac{\int_{0}^{\pi - \omega_{\tau}}\quad {{{\overset{\sim}{S}}_{f}(\omega)} \cdot {{\overset{\sim}{S}}_{f}\left( {\omega + \omega_{\tau}} \right)}}}{\sqrt{\int_{0}^{\pi - \omega_{\tau}}\quad {{{\overset{\sim}{S}}_{f}^{2}(\omega)}{\int_{0}^{\pi - \omega_{\tau}}\quad {{\overset{\sim}{S}}_{f}^{2}\left( {\omega + \omega_{\tau}} \right)}}}}}$

 and ω_(τ)=2π/τ, S_(f)(ω) is the magnitude spectrum of s_(i)(n), and{tilde over (S)}_(f)(ω) is the zero-mean version of S_(f)(ω).
 4. Themethod of claim 3 wherein the step of performing the robust pitchestimation allows for classification of the speech as voiced, unvoiced,or transitional uses the spectro-temporal auto-correlation criterionR(τ), such that if R(τ)>α_(y), the speech frame is classified as voiced,if R(τ)<α_(u), the speech frame is classified as unvoiced, and ifα_(y)≧R(τ)≧α_(u), the speech frame is declared transitional, whereinα_(v)=0.8 and α_(u)=0.5.
 5. The method of claim 1, in which the step ofobtaining the harmonics weighted spectrum is in accordance with theformula: HWS(ω)=w _(h)(w)·S _(f)(ω) wherein w_(h)(ω)=W_(H) or 1 andS_(f)(ω)=the magnitude spectrum of input speech.
 6. The method of claim5, wherein W_(H)=100 for voiced sounds.
 7. The method of claim 5,wherein W_(H)=10 for transitional sounds.
 8. The method of claim 1, inwhich the step of applying mel-scaled filters and computing the logenergy output is in accordance with the formula: Ŝ(ω)=[S(ω)]^(1/3) Ê ₁=log(E ₁),1≦i≦M wherein Ŝ(ω) is the compressed spectrum and Ê₁ is thelog energy for band-pass filter i.
 9. The method of claim 1, furthercomprising: prior to performing the robust pitch estimation, applyingthe intensity-loudness power law to the power spectrum to obtain aroot-power compressed spectrum.
 10. A speech recognition method using aharmonic weighing function in accordance with the formula:${w_{h}(\omega)} = \left\{ \begin{matrix}{{\max \left( {1,^{{({H_{a} - \eta})} \cdot \gamma}} \right)},{{{if}\quad \omega} \leq {\omega_{T}\quad {is}\quad {pitchharmonic}}}} \\{1,{otherwise}}\end{matrix}\quad \right.$

wherein $H_{a} = {\max\limits_{\tau}\quad {R(\tau)}}$

 is the harmonic confidence, η is the harmonic confidence threshold, γis the weight factor, R(τ) is the spectro-temporal autocorrelationcriterion, and ω_(T) is the cut-off frequency.
 11. The method of claim10 wherein ω_(T)η and γ are about 2.5 kHz, 0.5 and 10, respectively. 12.A speech recognition method using a harmonic weighing function inaccordance with the formula: ${w_{h}(\omega)} = \left\{ \begin{matrix}{{\max \left( {1,^{{({H_{a} - \eta})} \cdot \gamma}} \right)},{{{if}\quad \omega} \leq {\omega_{T}\quad {is}\quad {pitchharmonic}}}} \\{1,{otherwise}}\end{matrix}\quad \right.$

wherein $H_{a} = {\max\limits_{\tau}\quad {R(\tau)}}$

 is the harmonic confidence, η is the harmonic confidence threshold, γis the weight factor, R(τ) is the spectro-temporal autocorrelationcriterion, and ω_(T) is the cut-off frequency.
 13. The method of claim12 wherein η and γ0.5 and 10, respectively.
 14. The method of claim 12wherein ω_(T) is 4000π for transitional sounds.
 15. The method of claim12 wherein ω_(T) is obtained for voiced sounds in accordance with theformula:$\omega_{T} = {\arg \quad {\max\limits_{\omega_{T_{l}} < \omega < \omega_{T_{h}}}\frac{\left\lbrack {\int_{\omega_{T_{l}}}^{\omega}{{V_{s}(\Omega)}\quad {\Omega}}} \right\rbrack/\left( {\omega - \omega_{T_{l}}} \right)}{\left\lbrack {\int_{\omega}^{\omega_{T_{h}}}{{V_{s}(\Omega)}\quad {\Omega}}} \right\rbrack/\left( {\omega_{T_{h}} - \omega} \right)}}}$${{wherein}\quad {V_{s}(\Omega)}} = \frac{\int_{\Omega - \omega_{a}}^{\Omega}\quad {{{\overset{\sim}{S}}_{f}(\omega)}{{\overset{\sim}{S}}_{f}\left( {\omega + \omega_{0}} \right)}{\omega}}}{\sqrt{\left\lbrack {\int_{\Omega - \omega_{a}}^{\Omega}\quad {{{\overset{\sim}{S}}_{f}^{2}(\omega)}{\omega}}} \right\rbrack \left\lbrack {\int_{\Omega}^{\Omega + \omega_{a}}\quad {{{\overset{\sim}{S}}_{f}^{2}(\omega)}{\omega}}} \right\rbrack}}$

wherein ω_(T) ₁ and ω_(T) _(h) delimit the allowed interval for thecutoff frequency.
 16. The method of claim 15 wherein ω_(T) ₁ is 2000πand ω_(T) _(h) is 6000π.